An Analysis of Tennis Data

By: Alan Zhang

To do list:

-- Explanation on technological improvements. See 3A)

-- Court differences 3A)

Table of Contents

  1. Introduction
  2. Data Collection
  3. Exploratory Analysis

3A. Trends over time

  1. Making a model (unsuccessfuly)

4a. Matrix Plot

4b. Backwards Selection

  1. Conclusion and Final Thoughts

Introduction

-- Write intro here--

I did an analysis of ATP Tour matches to learn statistics, computer science, and to learn more about a sport I'm passionate about. I scraped data off of the internet, not downloading a single thing, and did statistical analysis on my favorite players. Working with the data has shown interesting trends over the history of tennis ---

2. Data Collection

We want to set up our data to look at specific statistics such as how stats differ for losers and winners on different courts, as well as statistics of top players over time.

Our code first sets up arrays to store these statistics, and then scrapes the data off of a publicly available webpage. I used this dataset: https://github.com/JeffSackmann/tennis_atp/blob/master/README.md

Since the dataset is stored as matches per year, I conglomerated all the years into one big dataset so I could analyze over time. While conglomerating, I extracted yearly averages of statistics I specifically wanted to look at.

In [13]:
# Necessary Imports
import pandas as pd
import datetime
import matplotlib.pyplot as plt
# Defining date parser for our data
def parse(t):
    ret = []
    for ts in t:
        try:
            string = str(ts)
            tsdt = datetime.date(int(string[:4]), int(string[4:6]), int(string[6:]))
        except TypeError:
            tsdt = datetime.date(1900,1,1)
        ret.append(tsdt)
    return ret

# Read in the first year of the data
df = pd.read_csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1968.csv", index_col=None,
                         header=0,
                         parse_dates=[5],
                         encoding = "ISO-8859-1",
                         date_parser=lambda t:parse(t))

# Create lists to store yearly aces of winners
yearlyAces = []
yearlyAces.append(df['w_ace'].mean())

# Create lists to store yearly aces of winners on different courts
yearlyAcesHard = []
yearlyAcesGrass = []
yearlyAcesClay = []

# Cut the dataframes on different courts
hardSurface = df.loc[df['surface'] == "Hard"]
grass = df.loc[df['surface'] == "Grass"]
clay = df.loc[df['surface'] == "Clay"]

# Add the first average yearly winners aces on different courts
yearlyAcesHard.append(hardSurface['w_ace'].mean())
yearlyAcesGrass.append(grass['w_ace'].mean())
yearlyAcesClay.append(clay['w_ace'].mean())

# Same thing for losers
yearlyLosersAces = []
yearlyLosersAces.append(df['l_ace'].mean())
hardSurface = df.loc[df['surface'] == "Hard"]
grass = df.loc[df['surface'] == "Grass"]
clay = df.loc[df['surface'] == "Clay"]

yearlyAcesHardL = []
yearlyAcesGrassL = []
yearlyAcesClayL = []
yearlyAcesHardL.append(hardSurface['l_ace'].mean())
yearlyAcesGrassL.append(grass['l_ace'].mean())
yearlyAcesClayL.append(clay['l_ace'].mean())

# Setting up data to look at the number of US players who finished in the top 4
USnumOfTop4 = []
top4 = df.loc[(df['round'] == "F") | (df['round'] == "SF") | (df['tourney_level'] == "G")]
USnumOfTop4.append(top4['winner_ioc'].value_counts()["USA"])

# Setting up data to look at the number of AUS, ESP, and FRA players who finished in the top 4
AUSnumOfTop4 = []
AUSnumOfTop4.append(top4['winner_ioc'].value_counts()["AUS"])
ESPnumOfTop4 = []
ESPnumOfTop4.append(top4['winner_ioc'].value_counts()["ESP"])
FRAnumOfTop4 = []
FRAnumOfTop4.append(top4['winner_ioc'].value_counts()["FRA"])


averageWinnersHeight = []
averageWinnersHeight.append(df['winner_ht'].mean())
averageLosersHeight = []
averageLosersHeight.append(df['loser_ht'].mean())

top8 = df.loc[(df['round'] == "F") | (df['round'] == "SF") | (df['tourney_level'] == "G") | (df['round'] == "QF")]
averageWinnersHeightTop8 = []
averageWinnersHeightTop8.append(top8['winner_ht'].mean())
averageLosersHeightTop8 = []
averageLosersHeightTop8.append(top8['loser_ht'].mean())

averageWinnersAgeTop8 = []
averageWinnersAgeTop8.append(top8['winner_age'].mean())
averageLosersAgeTop8 = []
averageLosersAgeTop8.append(top8['loser_age'].mean())

doubleFaultsW = []
doubleFaultsW.append(df['w_df'].mean())
doubleFaultsL = []
doubleFaultsL.append(df['l_df'].mean())

doubleFaultsWTop8 = []
doubleFaultsWTop8.append(top8['w_df'].mean())
doubleFaultsLTop8 =[]
doubleFaultsLTop8.append(top8['l_df'].mean())

for i in range(1969, 2022):
  url = "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_" + str(i) + ".csv"
  df1 = pd.read_csv(url, index_col=None,
                         header=0,
                         parse_dates=[5],
                         encoding = "ISO-8859-1",
                         date_parser=lambda t:parse(t))
  df = pd.concat([df,df1])
  yearlyAces.append(df1['w_ace'].mean())
  hardSurface = df1.loc[df1['surface'] == "Hard"]
  grass = df1.loc[df1['surface'] == "Grass"]
  clay = df1.loc[df1['surface'] == "Clay"]

  yearlyAcesHard.append(hardSurface['w_ace'].mean())
  yearlyAcesGrass.append(grass['w_ace'].mean())
  yearlyAcesClay.append(clay['w_ace'].mean())

  yearlyLosersAces.append(df1['l_ace'].mean())
  yearlyAcesHardL.append(hardSurface['l_ace'].mean())
  yearlyAcesGrassL.append(grass['l_ace'].mean())
  yearlyAcesClayL.append(clay['l_ace'].mean())

  top4 = df1.loc[(df1['round'] == "F") | (df1['round'] == "SF") | (df1['tourney_level'] == "G")]
  USnumOfTop4.append(top4['winner_ioc'].value_counts()["USA"])
  AUSnumOfTop4.append(top4['winner_ioc'].value_counts()["AUS"])
  ESPnumOfTop4.append(top4['winner_ioc'].value_counts()["ESP"])
  FRAnumOfTop4.append(top4['winner_ioc'].value_counts()["FRA"])

  averageWinnersHeight.append(df1['winner_ht'].mean())
  averageLosersHeight.append(df1['loser_ht'].mean())

  top8 = df.loc[(df['round'] == "F") | (df['round'] == "SF") | (df['tourney_level'] == "G") | (df['round'] == "QF")]
  averageWinnersHeightTop8.append(top8['winner_ht'].mean())
  averageLosersHeightTop8.append(top8['loser_ht'].mean())

  averageWinnersAgeTop8.append(top8['winner_age'].mean())
  averageLosersAgeTop8.append(top8['loser_age'].mean())

  doubleFaultsW.append(df1['w_df'].mean())
  doubleFaultsL.append(df1['l_df'].mean())

  doubleFaultsWTop8.append(top8['w_df'].mean())
  doubleFaultsLTop8.append(top8['l_df'].mean())


combined = df
display(combined)
tourney_id tourney_name surface draw_size tourney_level tourney_date match_num winner_id winner_seed winner_entry winner_name winner_hand winner_ht winner_ioc winner_age loser_id loser_seed loser_entry loser_name loser_hand loser_ht loser_ioc loser_age score best_of round minutes w_ace w_df w_svpt w_1stIn w_1stWon w_2ndWon w_SvGms w_bpSaved w_bpFaced l_ace l_df l_svpt l_1stIn l_1stWon l_2ndWon l_SvGms l_bpSaved l_bpFaced winner_rank winner_rank_points loser_rank loser_rank_points
0 1968-2029 Dublin Grass 32 A 1968-07-08 270 202866 NaN NaN Doug Smith U NaN AUS NaN 110196 NaN NaN Peter Ledbetter U NaN UNK NaN 6-1 7-5 3 R32 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1968-2029 Dublin Grass 32 A 1968-07-08 271 126914 NaN NaN Louis Pretorius R NaN RSA NaN 209536 NaN NaN Maurice Pollock U NaN IRL NaN 6-1 6-1 3 R32 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1968-2029 Dublin Grass 32 A 1968-07-08 272 209523 NaN NaN Cecil Pedlow U NaN IRL NaN 209535 NaN NaN John Mulvey U NaN IRL NaN 6-2 6-2 3 R32 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1968-2029 Dublin Grass 32 A 1968-07-08 273 100084 NaN NaN Tom Okker R 178.0 NED 24.375086 209534 NaN NaN Unknown Fearmon U NaN NaN NaN 6-1 6-1 3 R32 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1968-2029 Dublin Grass 32 A 1968-07-08 274 100132 NaN NaN Armistead Neely R NaN USA 21.305955 209533 NaN NaN Harry Sheridan U NaN IRL NaN 6-2 6-4 3 R32 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2722 2021-M-DC-2021-WG2-M-THA-DEN-01 Davis Cup WG2 R1: THA vs DEN Hard 4 D 2021-09-17 4 200416 NaN NaN August Holmgren R NaN DEN 23.394935 106397 NaN NaN Wishaya Trongcharoenchaikul R NaN THA 26.433949 6-4 6-4 3 RR 90.0 3.0 3.0 52.0 29.0 22.0 16.0 10.0 0.0 1.0 8.0 2.0 79.0 45.0 30.0 14.0 10.0 10.0 13.0 905.0 16.0 767.0 27.0
2723 2021-M-DC-2021-WG2-M-THA-DEN-01 Davis Cup WG2 R1: THA vs DEN Hard 4 D 2021-09-17 5 208937 NaN NaN Kasidit Samrej R NaN THA 20.629706 134087 NaN NaN Johannes Ingildsen R NaN DEN 24.202601 7-5 6-3 3 RR 88.0 6.0 6.0 58.0 28.0 21.0 13.0 11.0 3.0 7.0 5.0 8.0 76.0 43.0 24.0 13.0 10.0 5.0 11.0 1136.0 8.0 1546.0 2.0
2724 2021-M-DC-2021-WG2-M-TUR-LAT-01 Davis Cup WG2 R1: TUR vs LAT Hard 4 D 2021-09-18 1 123795 NaN NaN Altug Celikbilek U NaN TUR 25.015743 207669 NaN NaN Robert Strombachs U NaN GER 21.995893 7-6(4) 6-4 3 RR 124.0 6.0 8.0 80.0 34.0 25.0 23.0 11.0 6.0 8.0 3.0 5.0 97.0 50.0 34.0 20.0 11.0 6.0 9.0 170.0 427.0 671.0 38.0
2725 2021-M-DC-2021-WG2-M-TUR-LAT-01 Davis Cup WG2 R1: TUR vs LAT Hard 4 D 2021-09-18 2 117356 NaN NaN Cem Ilkel R 185.0 TUR 26.064339 105208 NaN NaN Ernests Gulbis R 190.0 LAT 33.037645 6-4 6-1 3 RR 65.0 2.0 2.0 41.0 27.0 23.0 7.0 9.0 0.0 1.0 1.0 5.0 45.0 21.0 16.0 7.0 8.0 2.0 7.0 176.0 408.0 196.0 359.0
2726 2021-M-DC-2021-WG2-M-TUR-LAT-01 Davis Cup WG2 R1: TUR vs LAT Hard 4 D 2021-09-18 4 144985 NaN NaN Ergi Kirkin R NaN TUR 22.628337 120270 NaN NaN Edgars Manusis U NaN LAT 34.557153 6-1 6-2 3 RR 48.0 4.0 4.0 44.0 20.0 18.0 15.0 8.0 1.0 1.0 0.0 1.0 49.0 34.0 17.0 7.0 7.0 3.0 7.0 311.0 166.0 NaN NaN

180305 rows × 49 columns

3a) Trends Over Time

After creating our big dataset and extracting info, here I graph the statsiscs I collected to observe some trends.

In [4]:
years = range(1968,2022)
plt.plot(years, yearlyLosersAces, label = "Total")
plt.xlabel("Years")
plt.ylabel("Average Number of Winner's Aces")
plt.title("Number of Winner's Aces Over Time")
plt.plot(years, yearlyAcesHard, label = "Hard")
plt.plot(years, yearlyAcesGrass, label = "Grass")
plt.plot(years, yearlyAcesClay, label = "Clay")
plt.legend()
Out[4]:
<matplotlib.legend.Legend at 0x7f72a0f435d0>

This is a graph of the number of aces a winner scores over time on different courts. As we can see, the general trend is upwards. This could be attributed to a multitude of reasons.

One such reason would be equipment improvements ---- or skill at receiving lags behind faster serves because players need to recieve

Another trend we can see is that aces are much more common on grass, and more rare on clay courts. --- Insert some science behind why clay is better traction so people get to the ball faster?

One outlier to note is the spike at 2020. Due to covid, there was only 2 games played on grass that year, the spike at 2020 is not useful.

In [6]:
plt.plot(years, yearlyLosersAces, label = "Total")
plt.xlabel("Years")
plt.ylabel("Average Number of Loser's Aces")
plt.title("Number of Loser's Aces Over Time")
plt.plot(years, yearlyAcesHardL, label = "Hard")
plt.plot(years, yearlyAcesGrassL, label = "Grass")
plt.plot(years, yearlyAcesClayL, label = "Clay")
plt.legend()
Out[6]:
<matplotlib.legend.Legend at 0x7f72a0a2ded0>

Here, we plot the same graph as above, but for the losing side of each match. We can see the same trends with courts and the number of aces going up, but the overall number of aces per match is lower than the winners which is to be expected.

Although this graph doesn't give us any new information it's reassuring to know that the data makes sense, losers are have worse stats than winners.

Next, I was interested in how strong the top regions were from around the world. I decided to measure this by seeing how many players from each top region made it to the top 4 in Grand Slam tournaments.

In [15]:
# Where good players come from over time
plt.plot(years, USnumOfTop4, label = 'US Players')
plt.plot(years, AUSnumOfTop4, label = 'Australian Players')
plt.plot(years, ESPnumOfTop4, label = 'Spain Players')
plt.plot(years, FRAnumOfTop4, label = 'France Players')
plt.title("Number of Players From Top regions Who made it to Top 4 in Grand Slams")

plt.xlabel("Years")
plt.ylabel("Number of Players At top 4 in Grand Slams")
plt.legend()
Out[15]:
<matplotlib.legend.Legend at 0x7f72a1177c90>

Looking the graph, the trend is clear. The us used to dominate in early years, but nearing the turn of the century, France and Spain in particular overtook us in the number of athletes making it into top 4 at grand slams.

-- Insert speculation as to why here --

A general interest in sports is the physical makeup of top athelets. What makes a good basketball player? The obvious answer is height. I wondered this about tennis so I analyzed the height and weight of winners and losers of every match over time.

In [16]:
# Heights of Losers VS Winners Over Time
print(averageWinnersHeight)
plt.plot(years, averageWinnersHeight, label = 'Winners')
plt.plot(years, averageLosersHeight, label = 'Losers')
plt.title("Heights of Winners and Losers over Time")
plt.xlabel("Years")
plt.ylabel("Average Height (CM)")
plt.legend()
[182.34983498349834, 182.15227070347285, 181.89511400651466, 182.48958862366683, 182.32646212847555, 181.93656422379826, 181.70344053851906, 181.39985512495474, 181.4168490153173, 182.17356173238525, 182.17900063251108, 182.70902505282223, 182.55652911249294, 182.450014240957, 182.99580382900604, 183.14154302670624, 183.6393958464443, 183.61497005988025, 183.57448818897637, 183.43741007194245, 183.34635488308115, 183.34514549121292, 183.82739264664608, 184.22594954255615, 184.63892653835728, 184.82095038067735, 184.9247339735271, 185.04727370400215, 185.26058631921825, 185.03660612939842, 184.68287103231341, 184.7820710973725, 184.8089716203845, 184.17411545623835, 184.51627457299386, 184.44355098102284, 184.9592092877314, 184.59016393442624, 184.9471853257432, 185.17994195420832, 185.69418145956607, 185.92167110519307, 186.0180466373775, 186.3782371028357, 186.36488706365506, 187.03943661971832, 187.20372359470105, 187.46723343960326, 187.31355023239186, 187.1811567842852, 187.9256229081443, 187.6109022556391, 187.55882352941177, 188.09273182957392]
Out[16]:
<matplotlib.legend.Legend at 0x7f72a10537d0>

Since I'm plotting the averages of many many matches, the differnce could be negligble, but the graph shows that winners are generally taller, which could lead some insight into height giving a slight advantage while playing tennis.

To narrow down the dataset a little and look at the peak of the sport, I decided to look at the heights of the Top 8 only to see if height mattered more or less at the highest level of competition.

In [22]:
# Heights of Losers VS Winners Over Time In Top 8
plt.plot(years, averageWinnersHeightTop8, label = 'Winners')
plt.plot(years, averageLosersHeightTop8, label = 'Losers')
plt.title("Heights of Winners and Losers in Top 8 over Time")

plt.xlabel("Years")
plt.ylabel("Average Height (CM)")
plt.legend()
Out[22]:
<matplotlib.legend.Legend at 0x7f72a1074710>

-- Why this could be --

On a similar note, age has always played different factors in different sports. I wondered how much age mattered in tennis so I plotted it.

In [18]:
# Ages of Losers VS Winners Over Time
plt.plot(years, averageWinnersAgeTop8, label = 'Winners')
plt.plot(years, averageLosersAgeTop8, label = 'Losers')
plt.title("Ages of Winners and Losers over Time")

plt.xlabel("Years")
plt.ylabel("Average Age (Years)")
plt.legend()
Out[18]:
<matplotlib.legend.Legend at 0x7f72a15e9290>

Although the graph looks like there is a trend, that players became younger, the age range is so small, most players being within 25.5 to 27 years old. This graph tells us very little.

-- Can find historical evidence of tennis being an old persons game --

Diving more into specific statistics, I was interested in the number of double faults over time, as I thought it should just get better over time as players become more consistent.

In [19]:
# Double Faults of Losers VS Winners Over Time
plt.plot(years, doubleFaultsW, label = 'Winners')
plt.plot(years, doubleFaultsL, label = 'Losers')
plt.title("Double Faults of Winners and Losers over Time")

plt.xlabel("Years")
plt.ylabel("Average Double Faults")
plt.legend()
Out[19]:
<matplotlib.legend.Legend at 0x7f72a1731cd0>

There are three things to analyze here.

From the graph we can see that the losers distribution is the same as the winners but shifted up by 0.8.

The clear takeaway is that the losers line is shifted up; losers have more double faults, because double faults contribute to a loss.

An interesting and not so clear takeaway from this graph is why the loss distribution looks so similar to the win distribution. It could have something to do with the fact that we are comparing winners and losers, and with the way we conglomerated our data, a player can contribute to both the winners and losers average.

The last takeaway would come from analyzing the trend over time, why both losers and winners started serving more double faults around 1997, then trending down, and back up.

-- Speculation -- Around 2005-6 ish people wanted more consistent serves, not giving up free points. After acheiving consistency, people tried to gain speed back, playing more high risk high rewards serves, resulting in more double faults.

Since this trend was so interesting, I decided to take a closer look at the highest level of competition. Here is a graph of double faults of winners vs losers from top 8 matches.

In [21]:
# Double Faults of Losers VS Winners Over Time In top 8 Mathches
plt.plot(years, doubleFaultsWTop8, label = 'Winners')
plt.plot(years, doubleFaultsLTop8, label = 'Losers')
plt.title("Double Faults of Winners and Losers in Top 8 over Time")

plt.xlabel("Years")
plt.ylabel("Average Double Faults")
plt.legend()
Out[21]:
<matplotlib.legend.Legend at 0x7f72a16d2a90>

From this graph the trend is even more exact. Losers of top 8 matches serve 0.8 to 0.9 more double faults than winners of these mathces. It is very confusing why it is so exact.

4a) Making a model

I decided to try and model what makes a good player by predicting player winrate from the available statistics. The first step in doing this is extracting each players's data from the match data.

In [34]:
listOfPlayers = combined['winner_name'].unique()

winCounts = combined['winner_name'].value_counts()
lossCounts = combined['loser_name'].value_counts()
winRates = []
for player in listOfPlayers:
  totalGames = winCounts.get(player, default = 0) + lossCounts.get(player, default = 0)
  # We filter out people who have played less than 100 games
  if totalGames < 100:
    continue
  else:
    wins = combined.loc[(combined['winner_name'] ==  player)]
    winAvgs = wins.mean()
    losses = combined.loc[(combined['loser_name'] ==  player)]
    lossAvgs = losses.mean()
    winRates.append((player, winCounts.get(player, default = 0)/totalGames, winAvgs['minutes'], winAvgs['w_ace'], winAvgs['w_df'], winAvgs['w_svpt'] \
                     , winAvgs['w_1stIn'], winAvgs['w_1stWon'], winAvgs['w_2ndWon'], winAvgs['w_SvGms'], winAvgs['w_bpSaved'], winAvgs['w_bpFaced'], lossAvgs['minutes']\
                     , lossAvgs['l_ace'], lossAvgs['l_df'], lossAvgs['l_svpt'], lossAvgs['l_1stIn'], lossAvgs['l_1stWon'], lossAvgs['l_2ndWon'], lossAvgs['l_SvGms']\
                     , lossAvgs['l_bpSaved'], lossAvgs['l_bpFaced']))
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: FutureWarning: DataFrame.mean and DataFrame.median with numeric_only=None will include datetime64 and datetime64tz columns in a future version.
  del sys.path[0]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:13: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  del sys.path[0]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:15: FutureWarning: DataFrame.mean and DataFrame.median with numeric_only=None will include datetime64 and datetime64tz columns in a future version.
  from ipykernel import kernelapp as app
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:15: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  from ipykernel import kernelapp as app

Now I can combine this all into its own datatframe

In [25]:
playersDataframe = pd.DataFrame.from_records(winRates, columns =['Player', 'Winrate','Winning Minutes', 'Winning Aces' , 'Winning Double Fault', 'Winning Serve Point', 'Winning First In', 'Winning First Point Won',\
                                                   'Winning Second Point Won', 'Winning Serve Games', 'Winning Break Points Saved', 'Winning Break Points Faced','Losing Minutes','Losing Aces' ,\
                                                   'Losing Double Fault', 'Losing Serve Point', 'Losing First In', 'Losing First Point Won',\
                                                   'Losing Second Point Won', 'Losing Serve Games', 'Losing Break Points Saved', 'Losing Break Points Faced'])
playersDataframe.dropna(how = 'any', inplace = True)
display(playersDataframe)
Player Winrate Winning Minutes Winning Aces Winning Double Fault Winning Serve Point Winning First In Winning First Point Won Winning Second Point Won Winning Serve Games Winning Break Points Saved Winning Break Points Faced Losing Minutes Losing Aces Losing Double Fault Losing Serve Point Losing First In Losing First Point Won Losing Second Point Won Losing Serve Games Losing Break Points Saved Losing Break Points Faced
124 Jimmy Connors 0.809131 121.097561 2.317073 2.317073 85.756098 56.487805 38.097561 15.878049 12.951220 4.609756 7.073171 114.631579 1.184211 2.657895 80.447368 51.684211 29.657895 13.078947 11.947368 5.605263 10.657895
187 Alexander Volkov 0.542705 100.187817 5.253521 3.032864 83.948357 44.769953 33.666667 20.887324 12.849765 4.215962 6.267606 95.481250 4.555556 4.333333 83.280702 42.953216 28.929825 17.812865 12.245614 5.374269 9.690058
234 John McEnroe 0.815087 114.483333 6.229508 4.081967 78.196721 44.901639 35.606557 17.590164 12.573770 3.262295 4.836066 119.696970 3.941176 3.970588 81.500000 44.764706 30.705882 16.764706 12.117647 4.852941 8.470588
250 Johan Kriek 0.630182 74.333333 5.666667 2.666667 66.000000 33.000000 26.333333 18.666667 10.000000 2.666667 3.666667 88.000000 4.500000 7.000000 79.375000 43.000000 30.000000 14.875000 11.375000 6.000000 9.875000
256 Ramesh Krishnan 0.525974 87.333333 0.666667 2.133333 63.000000 44.200000 29.333333 10.333333 10.000000 3.000000 5.066667 105.125000 0.625000 2.958333 79.250000 54.208333 32.166667 11.416667 12.291667 5.083333 9.916667
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
877 Lorenzo Sonego 0.496296 120.772727 6.954545 1.606061 79.439394 51.303030 39.621212 15.772727 12.787879 3.045455 4.363636 121.044776 5.432836 2.343284 82.776119 53.149254 36.492537 13.865672 12.597015 4.432836 7.611940
878 Felix Auger Aliassime 0.573864 113.760000 9.820000 3.730000 75.420000 49.020000 38.670000 14.290000 12.370000 2.830000 4.030000 117.864865 7.027027 5.135135 78.918919 49.108108 33.932432 13.283784 12.283784 5.000000 8.635135
879 Ugo Humbert 0.512605 111.137931 7.931034 2.689655 78.689655 49.465517 38.586207 16.275862 12.620690 3.327586 4.603448 125.357143 7.607143 3.392857 90.821429 54.714286 39.214286 17.553571 14.071429 5.160714 8.464286
880 Miomir Kecmanovic 0.470085 102.407407 4.981481 2.333333 73.796296 44.240741 33.740741 16.796296 12.000000 3.129630 4.703704 106.500000 3.459016 2.540984 81.491803 49.360656 31.950820 15.836066 11.950820 5.622951 9.377049
881 Jannik Sinner 0.653226 106.162500 4.564103 1.820513 74.576923 45.000000 33.217949 16.538462 11.782051 4.076923 5.641026 118.581395 3.465116 2.348837 82.325581 49.255814 32.581395 15.976744 12.372093 4.767442 8.604651

559 rows × 22 columns

Now I can do a preliminarly matrix plot to see if anyting is correlated with winrate well

In [27]:
import seaborn as sns
predictors = []
sns.pairplot(playersDataframe)
Out[27]:
<seaborn.axisgrid.PairGrid at 0x7f7280aa1150>

In this matrix plot, we would be looking for any variables that are correlated with winrate, but not correlated with each other so when we use them in the model, they do not explain the same variance.

However, looking at the diagonal of our matrix plot, we can see that most of our stats look normally distributed. This is a problem, because if we look at our scatter plots, none of our variables are correlated with winrate very well. This does not bode well for the model.

Nevertheless, I will attempt to continue with the model starting with a multiple regression model against all our statistics.

In [28]:
import statsmodels.api as sm
X = playersDataframe[['Winning Minutes', 'Winning Aces' , 'Winning Double Fault', 'Winning Serve Point', 'Winning First In', 'Winning First Point Won',\
                                                   'Winning Second Point Won', 'Winning Serve Games', 'Winning Break Points Saved', 'Winning Break Points Faced','Losing Minutes','Losing Aces' ,\
                                                   'Losing Double Fault', 'Losing Serve Point', 'Losing First In', 'Losing First Point Won',\
                                                   'Losing Second Point Won', 'Losing Serve Games', 'Losing Break Points Saved', 'Losing Break Points Faced']]
y = playersDataframe['Winrate']

X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
est.summary()
/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:117: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
  x = pd.concat(x[::order], 1)
Out[28]:
OLS Regression Results
Dep. Variable: Winrate R-squared: 0.216
Model: OLS Adj. R-squared: 0.187
Method: Least Squares F-statistic: 7.400
Date: Sun, 27 Feb 2022 Prob (F-statistic): 1.12e-18
Time: 22:21:13 Log-Likelihood: 566.48
No. Observations: 559 AIC: -1091.
Df Residuals: 538 BIC: -1000.
Df Model: 20
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 0.1040 0.110 0.946 0.345 -0.112 0.320
Winning Minutes -0.0027 0.001 -2.545 0.011 -0.005 -0.001
Winning Aces -0.0033 0.005 -0.603 0.547 -0.014 0.007
Winning Double Fault -0.0171 0.009 -1.820 0.069 -0.036 0.001
Winning Serve Point 0.0042 0.007 0.576 0.565 -0.010 0.018
Winning First In -0.0037 0.005 -0.712 0.477 -0.014 0.007
Winning First Point Won -0.0051 0.012 -0.432 0.666 -0.028 0.018
Winning Second Point Won -0.0078 0.014 -0.577 0.564 -0.034 0.019
Winning Serve Games 0.0381 0.029 1.337 0.182 -0.018 0.094
Winning Break Points Saved 0.0305 0.034 0.895 0.371 -0.036 0.098
Winning Break Points Faced -0.0368 0.033 -1.109 0.268 -0.102 0.028
Losing Minutes 0.0053 0.001 4.843 0.000 0.003 0.007
Losing Aces 0.0003 0.007 0.043 0.966 -0.013 0.014
Losing Double Fault 0.0148 0.008 1.765 0.078 -0.002 0.031
Losing Serve Point -0.0113 0.009 -1.266 0.206 -0.029 0.006
Losing First In -0.0023 0.006 -0.410 0.682 -0.013 0.009
Losing First Point Won 0.0264 0.017 1.553 0.121 -0.007 0.060
Losing Second Point Won 0.0174 0.018 0.977 0.329 -0.018 0.052
Losing Serve Games -0.0335 0.034 -0.982 0.327 -0.100 0.033
Losing Break Points Saved -0.0896 0.049 -1.836 0.067 -0.185 0.006
Losing Break Points Faced 0.0763 0.047 1.634 0.103 -0.015 0.168
Omnibus: 31.808 Durbin-Watson: 1.650
Prob(Omnibus): 0.000 Jarque-Bera (JB): 35.940
Skew: 0.578 Prob(JB): 1.57e-08
Kurtosis: 3.453 Cond. No. 6.03e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.03e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

I'm going to use backward selection in an attempt to improve my model.

Backwards selection means I will try and prune high p-value predicotrs out of the model as the high p-value shows that it adds little to the model. However, the model looks doomed because of the low R-Squared value.

In spite of this, we will continue.

In [29]:
X = playersDataframe[['Winning Minutes','Winning Double Fault', 'Winning Serve Games', 'Winning Break Points Saved', 'Winning Break Points Faced','Losing Minutes',\
                                                   'Losing Double Fault', 'Losing Serve Point', 'Losing First Point Won',\
                                                   'Losing Second Point Won', 'Losing Serve Games', 'Losing Break Points Saved', 'Losing Break Points Faced']]
y = playersDataframe['Winrate']

X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
est.summary()
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:117: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
  x = pd.concat(x[::order], 1)
Out[29]:
OLS Regression Results
Dep. Variable: Winrate R-squared: 0.212
Model: OLS Adj. R-squared: 0.193
Method: Least Squares F-statistic: 11.26
Date: Sun, 27 Feb 2022 Prob (F-statistic): 1.36e-21
Time: 22:22:48 Log-Likelihood: 565.05
No. Observations: 559 AIC: -1102.
Df Residuals: 545 BIC: -1042.
Df Model: 13
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 0.1100 0.108 1.018 0.309 -0.102 0.322
Winning Minutes -0.0026 0.001 -2.520 0.012 -0.005 -0.001
Winning Double Fault -0.0135 0.008 -1.632 0.103 -0.030 0.003
Winning Serve Games 0.0182 0.010 1.787 0.074 -0.002 0.038
Winning Break Points Saved 0.0134 0.016 0.814 0.416 -0.019 0.046
Winning Break Points Faced -0.0204 0.014 -1.512 0.131 -0.047 0.006
Losing Minutes 0.0053 0.001 5.111 0.000 0.003 0.007
Losing Double Fault 0.0134 0.007 1.853 0.064 -0.001 0.028
Losing Serve Point -0.0094 0.008 -1.199 0.231 -0.025 0.006
Losing First Point Won 0.0188 0.015 1.262 0.207 -0.010 0.048
Losing Second Point Won 0.0180 0.015 1.184 0.237 -0.012 0.048
Losing Serve Games -0.0317 0.033 -0.974 0.331 -0.096 0.032
Losing Break Points Saved -0.0810 0.047 -1.721 0.086 -0.174 0.011
Losing Break Points Faced 0.0679 0.045 1.507 0.133 -0.021 0.156
Omnibus: 33.206 Durbin-Watson: 1.658
Prob(Omnibus): 0.000 Jarque-Bera (JB): 37.962
Skew: 0.587 Prob(JB): 5.71e-09
Kurtosis: 3.503 Cond. No. 4.98e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.98e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Looking at the coefficents, they make sense. For example, winning minutes has a negative coefficient, as taking longer to win your games may mean you are a less dominant player in your games, thus having a lower winrate.

However, our R-Sqaured value is quite low, making this a bad model. This could be a result of tennis statistics being normally distributed, and looking at our matrix plot, it's hard to see correlations when all the points are bunched together.

We will continue selecting only the low p-value predcitors. With an alpha (significance level) of 0.05, there are only 2 predictors that satifsy this alpha.

In [30]:
X = playersDataframe[['Winning Minutes','Losing Minutes']]
y = playersDataframe['Winrate']

X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
est.summary()
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:117: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
  x = pd.concat(x[::order], 1)
Out[30]:
OLS Regression Results
Dep. Variable: Winrate R-squared: 0.177
Model: OLS Adj. R-squared: 0.174
Method: Least Squares F-statistic: 59.82
Date: Sun, 27 Feb 2022 Prob (F-statistic): 2.95e-24
Time: 22:23:24 Log-Likelihood: 553.03
No. Observations: 559 AIC: -1100.
Df Residuals: 556 BIC: -1087.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 0.1907 0.047 4.084 0.000 0.099 0.282
Winning Minutes -0.0023 0.000 -5.346 0.000 -0.003 -0.001
Losing Minutes 0.0052 0.000 10.935 0.000 0.004 0.006
Omnibus: 25.389 Durbin-Watson: 1.711
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.590
Skew: 0.527 Prob(JB): 1.02e-06
Kurtosis: 3.269 Cond. No. 1.79e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.79e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

liminating all of the predictors with p-values higher than a signfiicance level of 0.05 leaves us with just two predictors. Funnily enough, this tanks our R-squared making our already terrible model worse.

Looking at the scatter plots, I was already skeptical that a linear model would work to predict winrates. I was somewhat optimistic though, playing the game of tennis myself because I thought stats could show what made good players good, and what stats I could focus on myself to improve my winrate.

The only results that make sense are that the faster you win your games, the better your winrate. Similarly, the longer it takes you to lose, the better your winrate.

Conclusion. Currently, with publicly available stats, I can't make a linear model to predict wins. It seems like the stats people are choosing to document don't really matter. Maybe with more sophsticated stats such as shot selection, or percentage in that professional currently use, I could get more meaningful results.

After all this doom and gloom, I wanted to satisfy my fan perspective, comparing the best of the best players against each other. I decided to compare Federer, Nadal, and Novak.

In [32]:
# Analyze a single player's winrate over time
# let's look at how novak's stats in wins have changed yearly
novak = combined.loc[(combined['winner_name'] ==  'Novak Djokovic')]
nadal = combined.loc[(combined['winner_name'] ==  'Rafael Nadal')]
federer = combined.loc[(combined['winner_name'] ==  'Roger Federer')]

# Initialize stats array
novakStatsOverTime = []
nadalStatsOverTime = []
federerStatsOverTime = []

for year in range(2000,2022):
    wins = novak.loc[novak['tourney_date'] < datetime.datetime(year,1,1)]
    winAvgs = wins.mean()
    novakStatsOverTime.append((year, winAvgs['minutes'], winAvgs['w_ace'], winAvgs['w_df'], winAvgs['w_svpt'] \
                      , winAvgs['w_1stIn'], winAvgs['w_1stWon'], winAvgs['w_2ndWon'], winAvgs['w_SvGms'], winAvgs['w_bpSaved'], winAvgs['w_bpFaced']))
    
    wins = nadal.loc[nadal['tourney_date'] < datetime.datetime(year,1,1)]
    winAvgs = wins.mean()
    nadalStatsOverTime.append((year, winAvgs['minutes'], winAvgs['w_ace'], winAvgs['w_df'], winAvgs['w_svpt'] \
                      , winAvgs['w_1stIn'], winAvgs['w_1stWon'], winAvgs['w_2ndWon'], winAvgs['w_SvGms'], winAvgs['w_bpSaved'], winAvgs['w_bpFaced']))

    wins = federer.loc[federer['tourney_date'] < datetime.datetime(year,1,1)]
    winAvgs = wins.mean()
    federerStatsOverTime.append((year, winAvgs['minutes'], winAvgs['w_ace'], winAvgs['w_df'], winAvgs['w_svpt'] \
                      , winAvgs['w_1stIn'], winAvgs['w_1stWon'], winAvgs['w_2ndWon'], winAvgs['w_SvGms'], winAvgs['w_bpSaved'], winAvgs['w_bpFaced']))
    


novakOverTime = pd.DataFrame.from_records(novakStatsOverTime, columns =['Year','Winning Minutes', 'Winning Aces' , 'Winning Double Fault', 'Winning Serve Point', 'Winning First In', 'Winning First Point Won',\
                                                   'Winning Second Point Won', 'Winning Serve Games', 'Winning Break Points Saved', 'Winning Break Points Faced'])
nadalOverTime = pd.DataFrame.from_records(nadalStatsOverTime, columns =['Year','Winning Minutes', 'Winning Aces' , 'Winning Double Fault', 'Winning Serve Point', 'Winning First In', 'Winning First Point Won',\
                                                   'Winning Second Point Won', 'Winning Serve Games', 'Winning Break Points Saved', 'Winning Break Points Faced'])
federerOverTime = pd.DataFrame.from_records(federerStatsOverTime, columns =['Year','Winning Minutes', 'Winning Aces' , 'Winning Double Fault', 'Winning Serve Point', 'Winning First In', 'Winning First Point Won',\
                                                   'Winning Second Point Won', 'Winning Serve Games', 'Winning Break Points Saved', 'Winning Break Points Faced'])

plt.plot(novakOverTime['Year'], novakOverTime["Winning Serve Point"], label = 'Novak')
plt.plot(nadalOverTime['Year'], nadalOverTime["Winning Serve Point"], label = 'Nadal')
plt.plot(federerOverTime['Year'], federerOverTime["Winning Serve Point"], label = 'Federer')
plt.xlabel("Years")
plt.ylabel("")
plt.title("Number of Minutes to Win over Time")
plt.legend()
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:14: FutureWarning: DataFrame.mean and DataFrame.median with numeric_only=None will include datetime64 and datetime64tz columns in a future version.
  
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:18: FutureWarning: DataFrame.mean and DataFrame.median with numeric_only=None will include datetime64 and datetime64tz columns in a future version.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:22: FutureWarning: DataFrame.mean and DataFrame.median with numeric_only=None will include datetime64 and datetime64tz columns in a future version.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:22: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:18: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:14: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  
Out[32]:
<matplotlib.legend.Legend at 0x7f72663cbdd0>
In [33]:
%%shell
jupyter nbconvert --to html Alan_Final_Project.ipynb
[NbConvertApp] Converting notebook Alan_Final_Project.ipynb to html
[NbConvertApp] Writing 5635451 bytes to Alan_Final_Project.html
Out[33]: